141 research outputs found
Character Mapping and Ad-hoc Adaptation: Edinburgh's IWSLT 2020 Open Domain Translation System
This paper describes the University of Edinburgh’s neural machine translation systems submitted to the IWSLT 2020 open domain Japanese Chinese translation task. On top of commonplace techniques like tokenisation and corpus cleaning, we explore character mapping and unsupervised decoding-time adaptation. Our techniques focus on leveraging the provided data, and we show the positive impact of each technique through the gradual improvement of BLEU
Bilingual Document Alignment with Latent Semantic Indexing
We apply cross-lingual Latent Semantic Indexing to the Bilingual Document
Alignment Task at WMT16. Reduced-rank singular value decomposition of a
bilingual term-document matrix derived from known English/French page pairs in
the training data allows us to map monolingual documents into a joint semantic
space. Two variants of cosine similarity between the vectors that place each
document into the joint semantic space are combined with a measure of string
similarity between corresponding URLs to produce 1:1 alignments of
English/French web pages in a variety of domains. The system achieves a recall
of ca. 88% if no in-domain data is used for building the latent semantic model,
and 93% if such data is included.
Analysing the system's errors on the training data, we argue that evaluating
aligner performance based on exact URL matches under-estimates their true
performance and propose an alternative that is able to account for duplicates
and near-duplicates in the underlying data.Comment: Proceedings of the First Conference on Machine Translation (2016),
Volume 2: Shared Task Paper
A Deterministic Dependency Parser for Japanese
We present a rule-based, deterministic dependency parser for Japanese. It was implemented in C ++, using object classes that reflect linguistic concepts and thus facilitate the transfer of linguistic intuitions into code. The parser first chunks morphemes into one-word phrases and then parses from the right to the left. The average parsing accuracy is 83.6%
Thunderstorm nowcasting with deep learning: a multi-hazard data fusion model
Predictions of thunderstorm-related hazards are needed in several sectors,
including first responders, infrastructure management and aviation. To address
this need, we present a deep learning model that can be adapted to different
hazard types. The model can utilize multiple data sources; we use data from
weather radar, lightning detection, satellite visible/infrared imagery,
numerical weather prediction and digital elevation models. It can be trained to
operate with any combination of these sources, such that predictions can still
be provided if one or more of the sources become unavailable. We demonstrate
the ability of the model to predict lightning, hail and heavy precipitation
probabilistically on a 1 km resolution grid, with a time resolution of 5 min
and lead times up to 60 min. Shapley values quantify the importance of the
different data sources, showing that the weather radar products are the most
important predictors for all three hazard types.Comment: 15 pages, 3 figures. Submitted to Geophysical Research Letter
The SUMMA Platform:Scalable Understanding of Multilingual Media
We present the latest version of the SUMMA platform, an open-source software platform for monitoring and interpreting multi-lingual media, from written news published on the internet to live media broadcasts via satellite or internet streaming.This work was conducted within the scope of the Research and Innovation Action SUMMA, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688139
- …